add files to compute basic stats on pseudo crawl dataset#382

Open

SaulLu wants to merge 8 commits intobigscience-workshop:masterfrom

SaulLu:compute-statistics

Contributor

SaulLu commented Feb 1, 2022

As discussed with @thomasw21, this PR add basic slurm and python scripts to compute an intermiadiary metadata dataset and some statistics for the Pseudo Crawl dataset

SaulLu added 3 commits

February 1, 2022 16:12


          add files to compute basic stats on pseudo crawl dataset

9368b49


          Merge remote-tracking branch 'upstream/master' into compute-statistics

c388e47


          update statistics

19c004e

SaulLu force-pushed the compute-statistics branch from 96fcc41 to 19c004e Compare

February 2, 2022 08:25


          [pre-commit.ci] auto fixes from pre-commit.com hooks

11d0fd1

for more information, see https://pre-commit.ci

HugoLaurencon requested a review from thomasw21

February 4, 2022 17:12

thomasw21 approved these changes

View reviewed changes

Member

thomasw21 left a comment

LGTM! A few comments but should be good.

dashboard/python_scripts/compute_stats.py

+                  parser.add_argument(
+                      "--save-batch-size", type=int, required=True, help="Batch size when writing."
+                  )
+                  parser.add_argument("--use-datasets-caching", action="store_true")

Member

thomasw21 Feb 4, 2022

We never us it. Let's remove it.

Contributor Author

SaulLu Feb 8, 2022

In fact, it seems to me that I used use_datasets_caching

Contributor Author

SaulLu Feb 8, 2022

I'll removed save-batch-size

dashboard/python_scripts/compute_stats.py Outdated Show resolved Hide resolved

dashboard/python_scripts/compute_stats.py

+                      "--save-path-stats-json", type=str, help="Where to save the stats json."
+                  )
+                  parser.add_argument(
+                      "--save-path-stats-full-json", type=str, help="Where to save the stats json."

Member

thomasw21 Feb 4, 2022

Suggested change

      
                    "--save-path-stats-full-json", type=str, help="Where to save the stats json."
          
                    "--save-path-stats-full-json", type=str, required=True, help="Where to save the stats json."

dashboard/python_scripts/compute_stats.py

Comment on lines +71 to +77

+                  logger.info(f" --- Statistics not already computed for seed id {args.seed_id} ")
+                  if not args.use_datasets_caching:
+                      datasets.set_caching_enabled(False)
+                  else:
+                      logger.info(
+                          f"the datasets results will be cached at {config.HF_DATASETS_CACHE}."
+                      )

Member

thomasw21 Feb 4, 2022

Suggested change

      
                logger.info(f" --- Statistics not already computed for seed id {args.seed_id} ")
          
                if not args.use_datasets_caching:
          
                    datasets.set_caching_enabled(False)
          
                else:
          
                    logger.info(
          
                        f"the datasets results will be cached at {config.HF_DATASETS_CACHE}."
          
                    )
          
                logger.info(f" --- Statistics not already computed for seed id {args.seed_id} ")

Contributor Author

SaulLu Feb 8, 2022

In fact, it seems to me that I used use_datasets_caching

dashboard/python_scripts/compute_stats.py Outdated Show resolved Hide resolved

dashboard/python_scripts/compute_stats.py Outdated Show resolved Hide resolved

dashboard/python_scripts/compute_stats.py

+                  ]
+                  ds_html = ds_html.map(
+                      get_length_text,
+                      batched=False,

Member

thomasw21 Feb 4, 2022

Suggested change

      
                    batched=False,
          
                    batched=True,

You just have to code a batched version of get_length_text

dashboard/python_scripts/compute_stats.py

+                  with open(save_path_tmp, "w", encoding="utf-8") as f:
+                      json.dump(data_stats, f, ensure_ascii=False, indent=4)
+                  logger.info(f"Moving the saved dataset to {str(save_path.absolute())}")
+                  subprocess.run(["mv", save_path_tmp, str(save_path.absolute())])

Member

thomasw21 Feb 4, 2022

Yeah I'd recomment os.rename. subprocess doesn't tell you if it fails.

dashboard/python_scripts/compute_stats.py

Comment on lines +151 to +161

+                  save_path = Path(args.save_path_stats_full_json)
+                  tmp_file_name = f"tmp-{str(save_path.name)}"
+                  save_path_tmp = os.path.join(save_path.parent, tmp_file_name)
+                  logger.info(f"Saving the dataset at {save_path_tmp}")
+                  ds_html.to_json(
+                      save_path_tmp,
+                      batch_size=args.save_batch_size,
+                      num_proc=args.num_proc,
+                      compression="gzip",
+                  )
+                  logger.info(f"Moving the saved dataset to {str(save_path.absolute())}")

Member

thomasw21 Feb 4, 2022

Suggested change

      
                save_path = Path(args.save_path_stats_full_json)
          
                tmp_file_name = f"tmp-{str(save_path.name)}"
          
                save_path_tmp = os.path.join(save_path.parent, tmp_file_name)
          
                logger.info(f"Saving the dataset at {save_path_tmp}")
          
                ds_html.to_json(
          
                    save_path_tmp,
          
                    batch_size=args.save_batch_size,
          
                    num_proc=args.num_proc,
          
                    compression="gzip",
          
                )
          
                logger.info(f"Moving the saved dataset to {str(save_path.absolute())}")
          
                save_path_full = Path(args.save_path_stats_full_json)
          
                save_path_full_tmp = save_path_full.rename(f"{save_path_full.name}.tmp")
          
                logger.info(f"Saving the dataset at {save_path_full_tmp}")
          
                ds_html.to_json(
          
                    save_path_full_tmp,
          
                    batch_size=args.save_batch_size,
          
                    num_proc=args.num_proc,
          
                    compression="gzip",
          
                )
          
                logger.info(f"Moving the saved dataset to {str(save_path_full.absolute())}")

two things:

move with os.rename, or shutil.move
You're not scared of overiting previous full_json?

dashboard/slurm_scripts/compute_stats_on_pseudo_crawl.slurm

+              #SBATCH --partition=cpu_p1
+              #SBATCH --time 10:00:00              # maximum execution time (HH:MM:SS)
+              #SBATCH --output=/gpfswork/rech/six/uty16tp/code/big_science/logs/compute_stats_v5/%x-%j.out           # output file name
+              #SBATCH --array=1-604

Member

thomasw21 Feb 4, 2022

You need to add the new seeds. And I need to update the csv with the new seeds ....

Collaborator

HugoLaurencon commented Feb 8, 2022

@SaulLu can you edit and merge? Thanks!

SaulLu and others added 4 commits

February 8, 2022 15:33


          Update dashboard/python_scripts/compute_stats.py

d4f38a5

Co-authored-by: Thomas Wang <24695242+thomasw21@users.noreply.github.com>


          Update dashboard/python_scripts/compute_stats.py

f644aca

Co-authored-by: Thomas Wang <24695242+thomasw21@users.noreply.github.com>


          [pre-commit.ci] auto fixes from pre-commit.com hooks

81f0062

for more information, see https://pre-commit.ci


          Update dashboard/python_scripts/compute_stats.py

e17c42d

Co-authored-by: Thomas Wang <24695242+thomasw21@users.noreply.github.com>

Contributor Author

SaulLu commented Feb 8, 2022

@HugoLaurencon, do you need it urgently ?

Collaborator

HugoLaurencon commented Feb 9, 2022

@SaulLu No, no worries! It was just to tag you in case you didn't see the comments from Thomas

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet